This course covers the standard analysis of a single-cell RNA-seq dataset using the ASAP web tool. It follows to some extent the standard tutorial from Seurat, which we encourage the readers to consult if they want to dive more into single-cell data analyis, using R.
In this course, we will use an example data set consisting of 7,000 cells from the Fly Cell Atlas repository. These cells all come from the Body part, which was dissociated as a whole and sequenced using 10x Genomics technology over 10 batches, mixing males and females.
This set of 7,000 cells was carefully selected/filtered from the original Body dataset that consists of 96,926 cells. In particular, they were selected within a unique batch, in order to prevent the need for batch effect correction, or integration methods. They belong to a certain number of differently annotated cell types.
The goal of this exercise session is to:
On ASAP, you can create a project without registering using the "New sandbox project" button on the top bar. Note that you can create only one sandbox project per IP (i.e. if you create a new one, it will be overridden), and sandbox projects are automatically deleted after 48 hours.
You can also register to ASAP using the Login button on the top bar. Registered users don't have project number limitations, and projects are stored without any time limit. Registered users can aso share their projects with other registered users, for collaborative analysis and visualization.
You can whether download the dataset from the GitHub repository: data.CAS.drosophila.melanogaster.txt.gz and then upload it to the ASAP website, or simply copy/paste the URL in the ASAP URL path, so the ASAP server will automatically download it directly from its location.
Note: You don’t need to unzip it before uploading to ASAP.
Here, on ASAP, you should see something like that:
Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6 Cell_7 Cell_8 Cell_9 Cell_10 ...
128up 0 0 0 0 0 0 0 0 0 0 ...
14-3-3epsilon 1 1 0 0 0 0 0 0 0 0 ...
14-3-3zeta 2 2 1 2 1 3 0 0 2 2 ...
140up 0 0 0 0 0 0 0 0 0 0 ...
18SrRNA-Psi:CR41602 0 0 0 0 0 0 0 0 0 0 ...
18w 0 0 0 0 0 0 0 1 1 0 ...
... ... ... ... ... ... ... ... ... ... ...
Question: What do you see in this file? rows? columns? headers? separation? first column? values? Does it feel correctly parsed? It’s important to know how the file is formatted, to be able to parse it correctly in the next step.
You can try playing with the Parsing parameters to see what it changes (Delimiter, Gene name column, Header, etc...), but in principle you should have a matrix of 12818 genes and 7000 cells.
As you can see from the previous screen, you can load tar.gz files of .txt files in ASAP. But many other file formats are handled.
Here, the data is in text format and it should be depicted as a TXT logo next to the file name on the "Create project" page after you uploaded the file, if recognized as such.
Now you can also fill the upper part of the "New project" page, i.e.:
Then, you can click on the "Create Project" button on the bottom left of your screen, to run the parsing with selected parameters.
It will create a project page (check your URL, it should change to something like https://asap.epfl.ch/projects/xxxxxx). All projects are private by default, but you can make them public with a unique URL.
You should see a "Parsing" card which tells you that the Parsing step is running. It should take about ~1mn to ~1mn30.
While the parsing step is running, you can investigate the left panel, which contains all the steps of the single-cell analysis pipeline that we will run today.
Here we are in the "Report" part of the pipeline, which shows a summary of the whole project and will populate with different information over time. This is also where you can export your project outside of ASAP, whether into a Loom file or an h5ad file.
Once the parsing is over, and you click on the result card (from the Report page), it should automatically bring you to the "Pre-treatment > Parsing" step on the left panel, displaying a result card.
Note: If the parsing step failed, displaying an error message, it means that you probably did not set correctly the parameters in the previous step. In this case, you should start again, creatin a New Project..
If the parsing passed, then you should see an output summary like the following:
It gives you the following information
Note: You can now check the "Metadata & Expression matrices" section of the left panel. It allows to "Import metadata" in case you want to add more information on the genes or the cells, for e.g. sex, batch, condition, etc... The other tab shows the content of the ASAP objects that you work on within the project, i.e. all automatically generated metadata (_Detected_Genes, _Mitochondrial_Content, ...) or others that you've manually imported, or that will be added during the pipeline (clustering, UMAP, etc...). The "Details" button allows you to have more information / plots on the corresponding metadata, or to download it as a tsv or JSON file.
As mentioned before, if the genes were correctly recognized by our Ensembl database, then a "_Mitochondrial_Content" metadata is automatically created, which quantifies the mitochondrial content of each cell. This is useful because usually stressed or dying cells (to be filtered out) will exhibit higher mitochondrial content.
You can go to the "Pre-treatment > Cell filtering > New cell filtering" section. It should show different filtering thresholds as well as QC plots.
Note: Usually, we initially keep all cells except the very obvious outlier ones in a first pass. Until the UMAP/clustering step. And if we find bad quality clusters, then we go back to the Cell Filtering step, changing the threshods. So you could first deactivate all the filtering options, and check the QC plots.
For e.g. you could check the mitochondrial ratio distribution:
Question: Do you see any outlier cells, or cells with lower quality?
Note: This mimics the behaviour in Seurat, with similar plots for QC:
VlnPlot(data.asap, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
Once you've checked all QC metrics and adjusted the filtering parameters, then you can click on the "Filter" button to run the filtering step.
It will generate the same kind of cards that you've seen before. Especially now the "cells:" number should be lower, corresponding to the new number of cells in your ASAP object, according to your filtering options.
We saw that cells aggregate different number of reads. This is an issue for the next steps, as the gene expressions will not be directly comparable between two cells.
Note: If we would plot the sum of count per cell, for the 20 first cells, it would look like that:
So, for solving the issue, we normalize the data. By default, ASAP uses the default method from Seurat called LogNormalize which normalizes the data to some scaling factor (default = 10'000), i.e. for each cell divide the counts by the total number of reads for that cell, and then multiply by the scale.factor. Data is then natural-log transformed with log(1 + x)
Now you can run the Normalization using the "Pre-treatment > Normalization > New normalization" step. Then you will need to select a matrix on which to apply the normalization, i.e. whether the full initial matrix, or the filtered matrix. Pick the filtered matrix (you can identify them according to the step, or to the number of cells in the card) and click the "Normalize" button.
Note: So now, if we would check the sum of gene expression for each cell, it would look like that:
Question: Can you comment on the y-axis values after correction?
We calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). These are likely to be the most interesting genes, indeed non-varying genes (even if highly expressed) do not bring any information on the heterogeneity / segregation between the cell types.
Additionally, in some steps, restricting the calculation to only the HVG allows for:
So you can now run the HVG method using the "Pre-treatment > HVG > New HVG" step. 3 methods are available: Dispersion, Mean.var.plot, and vst.
We will use the "vst" method, which is the default method in Seurat
Note: ASAP allows to run multiple methods at each step, or same method with different parameters in order to compare them. Then, in the later steps, you can choose which one you prefer to use.
We again need to pick an "Input Matrix" for this step. This time we will take the matrix which is the output of the previous step, i.e. the Normalized matrix.
Note: The default number of features to keep (2000) is arbitrary, and can be tuned after looking at the data heterogeneity (UMAP/clustering). Usually 2000 is a good number, and is the default for most pipelines (Seurat, scanpy).
Usually, results are displayed as a scatter plot of gene expression vs gene variance. Indeed, one would expect that, because of technical bias, the lowly expressed is a gene, the more variable it is (noise).
The HVG method standardizes the variance across expression levels, which allows for a common thresholding regardless of expression level. Then, it takes top 2000 (by default, but can be tuned) genes by standardized variance.
The output of the result card will display a little icon, saying that if you click on the card you may see the expression vs standardized variance plot.
ASAP merges the Scaling and RegressOut function into a single function, similar to latest versions of Seurat. It may sound a bit confusing, but it makes sense in terms of “order of the operations” => it does not allow the user to perform RegressOut on Scaled data for e.g., which would be wrong.
By default, we can simply scale the data, and not regress out any covariate. Unless we know important covariates that are present in our data, and can prevent us to see the desired biological signal. For e.g. if the data was processed in batches, it could be useful to regress out these batches, to avoid this signal to be present in our final dataset.
Here, to simply scale the data, we can run the "Pre-treatment > Scaling" step on ASAP, simply selecting the "Scaling [Seurat]" method and the correct "Input data" i.e. the Normalization matrix without any "Optional parameters".
Let’s take an example of a potential covariate: we have 3 batches in our dataset (for e.g. three 10x runs that were merged together, or three RNA extraction merged into one 10x run), it could be that the batch signal is stronger than the biological signal (cell types), and thus the UMAP/t-SNE/PCA plot shows 3 clusters, perfectly matching our batches, instead of separating the cells per cell type. In this case the batch covariate needs to be regressed out.
If you don’t know any potential covariate, or you know some (like sex) but you don’t believe they should be stronger than your main signal, then you can go without regressing-out anything, to dimension reduction / clustering / differential expression, and then eventually “ping-pong” back to this step, to filter any found covariate.
If we want to regress out one (or multiple) covariates, we can use the "Optional parameters" section of the "New scaling" window, and pick some covariates (for e.g. regressing out “percent.mt”). In our case, we probably don’t need to do it.
Note 1: To be regressed out, the metadata needs to be in your ASAP object. Whether automatically computed during parsing, or a later step, or manually imported using "Metadata and Expression matrices > Import metadata"
Note 2: The “regress out” step performs a simple linear regression, and return the residuals, which are then scaled
This step is crucial for the remaining of the ASAP pipeline, since many of the downstream tools will use the PCA as input (UMAP, t-SNE, clustering).
By default, the PCA is run only on the more meaningful features (HVG), both for removing noise, and for speeding up the computation.
To run it, you need to select the "Pre-treatment > PCA" step from the left panel, and then click on "New PCA". Then you need to provide a scaled matrix, and a list of variable features (from the HVG step).
You can also select a certain number of principal components to compute. In this example, we will stick with 50.
Note: Selecting the optimal number of principal components (PCs) from the PCA for downstream analysis can be tedious. In principle the number of PCs should scale with the heterogeneity of your dataset, i.e. the more cluster/sub-populations you expect to see, the more PCs you should select. A more optimal solution (that we will not describe here), is to use JackStraw plots. This is not yet integrated in the ASAP pipeline though, because it is very computer intensive.
Once the PCA is computed, we can visualize it using the "Visualization" step on the left panel.
Note: Since it's computed for 50 dimensions, you can even plot it in 3D. However, t-SNE and UMAP plots are usually optimized for 2D, and thus will only be visualizable in 2D.
Usually, for single-cell RNA-seq datasets, PCA is not very good at showing the complexity of the data in its two first components (as we can see in the previous plot, and in general it’s even far worse than this). Therefore, we are more likely to use dimension reduction techniques that are more tailored towards 2D visualization, such as t-SNE or UMAP. Both of these methods run directly on the PCA (not on the original scaled matrix).
First, we’ll run a UMAP. We can go to the "Pre-treatment > UMAP" step on the left panel, and click on the "New UMAP" button. The UMAP is usually run on the PCA embeddings, and you can pick any number of PCs from the PCA that you've already computed (so 1 to 50 in our case). But we will stick with 50 in our example.
Once the UMAP is computed, we can visualize it using the "Visualization" step on the left panel, similarly to the previous step.
DSimilarly than the previous step, we can run and visualize our data using a t-SNE plot.
Note: I would personally rely better on the UMAP results than the t-SNE, because UMAP preserves the distance inter-clusters. But it’s open to interpretation, and different bioinformaticians will prefer one or the other (it can also depend on the dataset your are studying). Also remember that both these methods are heuristics, i.e. if no random seed is set, running it twice can generate different results.
In the visualization panel, there are many options that you can tune/modify if you click on the "Controls" button on the bottom right:
Note 1: The colors are subdivided into Continuous (e.g. gene expression or any continuous metadata values such as depth) or Discrete (for categorical metadata, such as clustering)
Note 1: If the visualization is too zoomed in, or you want to reset the axes, you can click on the "home/house" button from the plot tools, or simply double click on the plot.
For example, a good practice is to always verify that a cluster is not formed from low count / high mito% cells i.e. potentially bad quality cells. We can color our UMAP using the "Controls" panel, and then the "Coloring > Continuous > Data type = Numerical continuous metadata > Data source = _Depth"
Doing this, we see the reverse, i.e. that one cluster seems to have higher number of UMIs. However, this could be biological, since bigger cells usually have more RNA content, so certain cell-types may aggregate more UMIs. So we keep it as such for now, and we'll see if there are specific markers for this cluster, that could explain this behaviour.
You probably noticed that the previous plots did NOT show any clustering of the data (all cells are the same color). This is because PCA, t-SNE and UMAP are NOT clustering methods, they are dimension reduction methods. Now, we need to run a clustering method to be able to visualize predicted clusters.
This is where things get very arbitrary, because a human viewer would like to better match the clusters to the UMAP/t-SNE representation. However, clustering methods are usually run on the PCA data, which contains more information than the reduced UMAP/t-SNE and thus may not match perfectly what we see in the UMAP/t-SNE. It would be wrong, though, to run the clustering on the UMAP / t-SNE data, as these two methods do not preserve the data structure and simplify the data (only 2 dimensions), because they are only meant for visualization.
Now, we will run the clustering from the "Clustering" step on the left panel. We click "New clustering" and pick our PCA, with 50 PCs. To start I will pick a resolution of 0.5.
Once computed, we can go to the "Visualization" panel similarly to the previous sections. Now, we will color the UMAP using the "Coloring > Discrete" tab and selecting the clustering we performed.
Note: Here I've computed the clustering with a resolution of 0.5, which produces 8 clusters. However, if I increase the resolution (e.g. 0.8), then I will get more clusters (in this case 11). Choosing the optimal number of cluster is somewhat arbitrary and usually rely on the detection of specific marker genes. i.e. if a cluster shares all its marker genes with another, then it should be merge with it.
Question: Is there 8 clusters in our data? It’s difficult to know if it’s an optimal number for our dataset, so you can play with the resolution parameter to change its behaviour. Of note, another way to know if 2 clusters can be merged together (or not), is to check the marker genes of these clusters (next step).
Marker genes are the top up-regulated genes of one cluster (vs all other cells). It is usually computed using a Wilcoxon test (but other methods exist).
In ASAP, there are 2 ways of doing this. The easiest is to simply click on the "0 annotation" button next to a cluster in the Visualization. This will open the "Annotation" panel (that we will use in the next step). In this panel, the Evidences are the marker genes for the selected cluster. They are automatically computed for all clusters once you click on this button.
Alternatively, you can use the "Differential expression > New" step to do this. It allows more fine grained DE analysis since you can also compare one cluster vs another (instead of one vs all other cells). But you will probably not need this for the current exercise.
For example, if you run it for cluster 3, you should see the following table:We can then visualize the expression of the top DE genes found for cluster 3 within our UMAP (from the annotation pane, you can simply click the "eye" icon in front of the gene name). In the following example, I picked the second gene, "grh"
This is the final and most tedious task to do, since it often requires manual check of the DE genes, and some expert knowledge:
On ASAP, you can use the "Annotate" button that we saw previously to annotate your clusters (using the "New annotation" button), whether using an ontology term, or just Free text.
Question: Now you can annotate your clusters with curated cell-types you identified.
Note 1: The DE genes found for each cluster vastly depend on the background cells (the other clusters). Which means that if you have only a subset of cell types in your dataset, you may find many DE genes that seem to be specific to your cluster of interest, simply because they are not expressed in the other clusters (but could be expressed in other cell types, if they are not present in your dataset)
Note 2: There are methods for automatic prediction of clusters based on curated marker genes database (for e.g. Garnett), but they are not yet perfect (especially for rare cell types) and don’t have curated database for all species/tissues. They are not yet implemented on ASAP.
Note 3: There is also a growing amounf of studies that rather perform “integration” of cell atlases with their dataset, in order to see which annotated cell types their cluster would match. This requires specific integration step at the beginning of the project using Seurat integration method or external methods such as Harmony
Using this tutorial, you should now be able to run the basic ASAP pipeline on any dataset. Of course, there is not a unique pipeline that will work on all datasets, and it requires to be adapted/tuned for each dataset specificities. But at least, this course will give you the canvas to start an analysis, and maybe go beyond, and perform more in-depth downstream analyses once you feel more secured.
Here, the tutorial is ran entirely on the ASAP web portal, so that you don’t need to know R/Python to be able to run the analysis, but if you want to go deeper, then of course you will have to learn R or Python, and follow the Seurat or scanpy tutorials. There are good online books from the CRAN community, or more recent one that work a bit like this tutorial. You can also attend online courses, but in any case, I would strongly recommend to follow any of these courses/tutorials if you intend to perform single-cell analyses in the future.
Of course, it is possible to run this whole analysis in R without using the ASAP framework. An online book, “Orchestrating single-cell analysis with Bioconductor”, is an excellent resource for performing this. It can also be run in Python, using the scanpy package, if you prefer Python over R.